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Abstract 



This paper presents studies on a determin- 
istic annealing algorithm based on quantum 
annealing for variational Bayes (QAVB) in- 
ference, which can be seen as an extension of 
the simulated annealing for variational Bayes 
(SAVB) inference. QAVB is as easy as SAVB 
to implement. Experiments revealed QAVB 
finds a better local optimum than SAVB in 
terms of the variational free energy in latent 
Dirichlet allocation (LDA). 



1 Introduction 

Several studies that are related to machine learn- 
ing with quantum mechanics have recently been con- 
ducted. The main idea behind these has been based 
on a generalization of the probability distribution ob- 
tained by using a density matrix, which is a self- 
adjoin t positive-semidefinite matrix of trace one. IWoH 
( 20061 ) connects the basic probability rule of quan- 
tum mechanics, called the "Born Rule", which for- 
mulates a generalized probability by using a den- 
sity matrix, to spectral clustering and other ma- 
chine learning algorithrn s base d on spectral theory. 
Crammer and Globerson ( 2006f ) combined a margin 
maximization scheme with a probabilistic modeling 
approach by incorporating the con cepts of q u antum 
detection and estima ti on th eory ( Helstroim 19691 ). 
Tanaka and Horiguchi ( 20021 ) proposed a quantum 
Markov random field using a density matrix and quan- 
tum mechanics and applied to image restoration. 



Generalizing a Bayesian framework bas ed on a density 
matrix has also been proposed. iSchack et al. 1 llooi,) 



proposed a "quantum Bayes rule" for conditional den- 
sity between two probability spaces. Warmuth et al. 
generalized the Bayes rule to treat a case w here the 
prior was a density matrix ( Warmuth . 2005f) and uni- 
fied Bayesian probability calculus for density matrices 
with ru les for translatio n between joints and condi- 
tionals ( Warmuthl . l2006l ) . Typically, the formulas de- 
rived by quantum mechanics generalization have re- 
tained the conventional theory as a special case when 
the density matrices have been diagonal. Computing 
the full posterior distributions over model parameters 
for probabilisti c graphica l mode ls, e.g. latent Dirich- 
let allocation (Bleieta D, l2003l) . remains difhcult in 
these quantum Bayesian frameworks, as well as classi- 
cal Bayesian frameworks. In this paper , we ge neralize 
the variational Bayes inference (jAttiad . 1999t ). which 
is widely used framework for probabilistic graphical 
models, based on ideas that have been used in quan- 
tum mechanics. 

Variational Bayes (VB) inference has been widely used 
as an approximation of Bayesian inference for proba- 
bilistic models that have discrete latent variables. For 
example, in a probabilistic mixture model, such as a 
mixture of Gaussians, each data point is assigned to 
a latent class, and a latent variable corresponding to 
a data point indicates the latent class. VB is an opti- 
mization algorithm that minimizes the cost function. 
The cost function, called the negative variational free 
energy, is a function of latent variables. We have called 
the cost function "energy" in this paper. 

Since VB is a gradient algorithm similar to the Expec- 
tation Maximization (EM) algorithm, it suffers from 
a local optimal problem in practice. Deterministic 
annealing (DA) al gorithms have been prop osed for 
the EM algorithm dUeda and Nakanol . Il995l ) and VB 
( Katahira et al.l . 12008 ) based on simulated annealing 



(SA) ( Kirkpatrick et al. . 19831 ) to overcome issue with 
local optima. We called simulated annealing based VB 
SAVB. SA is one of the most well known physics based 
approaches to machine learning. SA is based on the 
concept of statistical mechanics, called "temperature" . 
We decrease the parameter of "temperature" gradu- 
ally in SA. Because the energy landscape becomes flat 
at high temperature, it is easy to change the state 
(see Figllja)). However, the state is trapped at low 
temperature because of the valley in the energy bar- 
rier and the transition probability becomes very low. 
Therefore, SA does not necessarily find a global op- 
timum in the practical cooling schedule of tempera- 
ture T . In physics, quantum annealing (QA) has at- 
tracted attention as an alternative annealing method 
of optimization problems by a process that is anal- 
ogous to quantum fluctu ations ("ApoUoni et al.', 'l989|; 
iKadowaki and Nishimori . 1998; Santoro et al., 2002). 
QA is expected to help states avoid being trapped by 
poor local optima at low temperatures. 

The main point of this paper is to explain the novel 
DA algorithm for VB based on the QA (QAVB) we 
derived and present the effects of QAVB we obtained 
through experiments. QAVB is a generalization of VB 
and SAVB attained by using a density matrix. We 
describe our motivation for deriving QAVB in terms 
of a density matrix in Section [31 Here, we overview 
the QAVB that we derived. Interestingly, although 
QAVB is generalized and formulated by a density ma- 
trix, the algorithm for QAVB we finally derived does 
not need operations for a density matrix such as eigen- 
value decomposition and only has simple changes from 
the SAVB algorithm. 

Since SAVB does not necessarily find a global op- 
timum, we still need to run multiple SAVBs inde- 
pendently with different random initializations where 
TO denote the number of SAVBs. Here, let us con- 
sider running dependently, not independently, multiple 
SAVBs where "dependently" means that we run multi- 
ple SAVBs introducing interaction / among neighbor- 
ing SAVBs that are randomly numbered such as j — 1, 
j and J + 1 (see FiglTJb)). In Fig[l] (jj indicates the 
latent class states of N data points in the j-th SAVB. 
The independent SAVBs have a very low transition 
probability among states, i.e., they have been trapped, 
at high temperature as shown in Fig[T]Jc), while the 
dependent QAVBs can changes the state in that situ- 
ation. This is because interaction / starts from zero 
(i.e., "independent"), gradually increases, and makes 
Uj-i and Uj approach each other, the state will then 
be moved into a* . If there is a better state around 
sub-optimal states that the independent SAVBs find, 
the dependent SAVBs are expected to work well. The 
dependent SAVBs are just QAVB where interaction / 



and the above scheme are derived from QA mecha- 
nisms as will be explained in the following section. 

This paper is organized as follows. In Section[2l we in- 
troduce the notations used in this paper. In Section[3l 
we motivate QAVB in terms of a density matrix. Sec- 
tion [4] and [5] explain how we derive QAVB and present 
the experimental results in latent Dirichlet allocation 
(LDA). Section [6] concludes this paper. 

2 Preliminaries 

We assume that we have TV data points, and they are 
assigned to K latent classes. The latent class of the 
i-th data point is denoted by the latent variable Zi. 
Zi = k indicates that the latent class of the i-th data 
point is k. The latent class of the i-th data point is 
also denoted by K dimensional binary indicator vec- 
tor di where if Zi is equal to k, the fc-th element of 
di is equal to 1 and the other elements are all equal 
to 0. The number of available class assignment of all 
data points is . The class assignment of all data 
points is denoted by dimensional binary indicator 
vector a = ^f^i o'i where is the Kronecker prod- 
uct, which is a special case of a tensor product. If A 
is k-hy-l matrix and B is an TO-by-n matrix, then the 
Kronecker product B is the km-hy-ln block ma- 

(auB ■■■ aiiB\ 
: • . : . For 
flfeiB • • • QkiB/ 
example, if = 2, iV = 2, zi = 1 (cti = (1,0)^) and 
Z2 = 2 (5-2 = (0, 1)^), then a = cti (g) 5-3 = (0, 1, 0, 0)^. 

Let X = (xi,--- ,xn) denote the N observed data 
points and 6 denote the model parameters, cr^'^ indi- 
cates the l-th latent class states of K'^ available la- 
tent class states. For example, ii K ^ 2 and N = 2, 
then = (1,0,0,0)^, (7(2) ^ (0,1,0,0)^, (t(3) ^ 
(0,0,1,0)^ and cr(4) = (0,0,0,1)^. The set of avail- 
able latent class states is denoted by E = {(T*^''|(Z = 
1,2,-.. ,if^)}. 

3 Motivation for QAVB in terms of 
Density matrix 

For those unfamiliar with quantum information pro- 
cessing, we will explain a density matrix which can be 
used as an extension of conventional proba bility. Our 
definition of a density matrix is based on ( Warmuthl . 
20061 ). 



A density matrix is a self-adjoint positive-semidefinite 
matrix and its trace is one. Conventional probability 
which we called classical statistics can be expressed by 
a diagonal density matrix as follows. For example, let 
us consider the case of two data points and two la- 
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Figure 1: (a) Schematic picture of SAVB. (Upper panel) At low temperature, the state often falls into local 
optima. (Bottom panel) At high temperature, since the energy landscape becomes flat, the state can change 
over a wide range, (b) and (c) Schematic picture of QAVB. (b) QAVB connects neighboring SAVBs. (c) Cj can 
reach a* owing to the interaction /. It seems to go through energy barrier. 



tent classes as well as Section 2. We define four states, 
denoted by indicator vectors {c^'^jf^i, and probabil- 
ity vector p = (pi,p2,P3,P4)^, where pi indicates the 
occurrence probability of the i-th state cr^'^ . 



Then, the density matrix of this system is given by 



diag{pi,p2,P3,PA} 



E 

i=l 



PiCr' 'a- 



(1) 



where diag{-} indicates diagonal matrix. We can ex- 
tend the concept of probability by introducing non- 
diagonal elements in a density matrix which is called 
quantum statistics. A state of a system in quantum 
statistics is defined by a unit (column) real vectoi0 
, M, where dyad uu^ has trace one, Tr (tttt"^) = 
Tr (m^m) = 1. A density matrix, generalizes a 
finite probability distribution and can be defined as a 
mixture of dyads. 



$ = 2_^p^u,u^ , 



(2) 



where pi is a mixture proportion (coefficient) that is 
non-negative and sums to one. pi specifies the pro- 
portion of the system in state Ui. A density matrix 
assigns a probability to the unit vector or its asso- 
ciated dyad given by p{u) — Tr ($mm-^) {— u^^u). 
This is called the "Born rule" in quantum mechanisms. 
According to Gleason's theorem, there is a one to one 
correspondence between gener alized probabil ity distri- 
butions and density matrices ( Gleason . 1957t ). For ex- 

2 



ample, when a state vector is u = ( i, 0,-^,0 



^ A state vector generally does not need to be a restricted 
real vector. If we consider a complex vector, the definition 
of the trace of a dyad is replaced by Tr (uu*) — Tr {u*u) = 
1, where u* indicates complex conjugate of u. However, for 
simplicity, we have restricted the real vector in this paper. 



represents the mixture of the first state and the third 
state with probability (i)^ = j and (^^^ — |, re- 
spectively. 

A probabilistic model employs uncertainty to model 
phenomena, and has demonstrated its practically in 
many scientific fields. Although classical statistics in- 
volves uncertainty over mixture proportions ({pi}), it 
restricts state vectors to indicator vectors ({cr^*^}). In 
contrast, quantum statistics involves uncertainty over 
not only mixture proportions {{pi}) but also state vec- 
tors {{ui}) because if density matrix $ has off-diagonal 
elements, state vectors {ui} take arbitrary vectors. 
Therefore, a probabilistic model based on quantum 
statistics is a more generalized model in terms of uncer- 
tainty, and the generalization is expected to be more 
useful. In the same way, since classical VB inference 
including SA variants only involves uncertainty over 
mixture proportions, this paper proposes a method of 
maintaining uncertainty over state vectors. 

Finally, Fig [5] sums up the relationship between VB, 
SAVB, and QAVB in terms of a density matrix. SAVB 
and QAVB control uncertainty of mixture proportions 
via temperature T. However, QAVB can control the 
uncertainty of state vectors by introducing quantum 
effect parameter F that is described in Section [U lead- 
ing to enhanced generalization. 



4 Quantum Annealing for Variational 
Bayes Inference 

This section explains how we derive update equa- 
tions for QAVB. First, we define the lower bound 
of the marginal likelihood in QAVB as typical VB. 
Then, we apply Suzuki- Trotter expansion (jTrotterl . 
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Figure 2: The uncertainty over mixture proportions 
has been well studied in machine learning. VB and 
SAVB also only involve uncertainty over mixture pro- 
portions. We study the uncertainty over another com- 
ponent of a density matrix, state vectors. QAVB in- 
volves uncertainty over not only mixture proportions 
but also state vectors. 



119591 : ISuztikil . Il976l ) to the marginal of QA to ana- 
lytically obtain update equations. 

4.1 Introducing Quantum Effect 

We define Tic with a if ^ by A'^ diagonal matrix as 
follows: 



He — diag{— logp{x, a^^^) 



logp(a;,a(^ 0} 



(3) 



The conditional probability of indicator state vector a 
given X is calculated by 



p{a\x) 



p{x) 



where $c 



Tr (e-'«o) 



Tr(e-«c)-^''*^'^-Tr($eCTa^) 

(4) 

is a density matrix. 



The marginal log-likelihood of N data points is formu- 
lated as 



logp(a;) = logTrle"""^}. 



Since the fully conditional posteriors are intractable, 
VB inference is proposed as an approximated algo- 
rithm for estimating conditional posteriors. 

The marginal log-likelihood of p{x) can be lower 
bounded by introducing distribution over latent vari- 
ables CT, parameters 6 and the approximate distribu- 
tion q{a)q(6) of a posteriori distribution p{a,6\x) as 
follows. 



\ogp{x) >Y. [ 9(09(^)log 



P{x, a, 9) 
q{a)q{e) 



(6) 
(7) 



We maximize F\q{a),q{0)] with respect to q{a)q{6) 
to obtain a better approximation oi p{fj^O\x) in VB 
inference. F[q{(7) , q{0)] is called the variational free 
energy. 

We derive QAVB by maximizing the lower bound of 
the following marginal log-likelihood. 



logp(a;;/3,r) =logTr{e-''^} 



(8) 



where T is the quantum effect parameter, /3 is inverse 
temperature, i.e., /3 = -^r, and we define H. with a 
by matrix as follows: 

TL ^Ti-c + Ti-q, 

N 

Hq ^ ^ '^xi ; ^xi ^ 

1=1 

ax =r(EK - Ik), 

where Kk is the K hy K identity matrix, is the 
K hy K matrix whose elements are all one, and Ti.q 
is a symmetric al matrix. The above Ti. is a sta ndard 
setting for Q A (jKadowaki and Nishimoril . 11998! ) . The 
conditional probability of a given x, (5 and T is calcu- 
lated by 




p{a\x\j3,T) 



= a^^qa = Tr {^qaa^^ 



Tr (e-^«) 



(11) 



where $q = -r^'^^g-zj-H-) is a density matrix. 

Note that Ti. becomes diagonal if F is zero, in which 
case it reduces to Tic, and quantum log- likelihood 
logp{x; F, P) in Eq. ([5]) becomes classical loglikelihood 
logp(a;) in Eq. ([5]) if /3 is one. 

The following section explains how we derived an ap- 
proximated posteriori distributions that maximized 
the lower bound of logp(a;; F, /?). 

4.2 Derivation 



(5) 

Let CTj be one of all the available class assignment 



states of N data points, s.t. aj G S. The class of 
the i-th data point in aj is denoted by aj^i, s.t. aj = 



^iLi^J,i- 1* intractable to evaluate logTrje^'^^} 
because Ti. is not diagonal. However, we can approxi- 
mately trace e^'^^ by S uzuki- Trotter expansion as fol- 
lows (see Appendix El (Suzuki, 1976I ). 



p{x;T,l3)^p{x;T,l3,m) + 
p{x;r,P,m) = 



(12) 



N 



s{aj,aj+i) =^(5(ctj,,,o-j+i,j), /(/3,r) = log(^^i^), 

(14) 



i=l 



exp( ), b ■ 

m 



K 



a(a 



-K 



1), (15) 



where S{aj,i,aj+i^'^ 



1 if a 



a-j+iA, and 



d{aj,i, dj+i^i) = otherwise. We assume a peri- 
odic boundary condition, i.e., dm+i,i = <^i,i- w is 
called Trotter number where the above trace can be 
accurately evaluated within the limit of m — > oo. 
-^s((Tj, cTj+i) indicates a similarity measure that takes 
[0,1] where j^s{aj,aj^i) — 1 when Cj — cTj+i and 
-^s(crj, (jj+i) = when aj and (Tj+i are completely 
different. 

In the following, we derive the lower bound of 
logp(a;; F, /3, to) by introducing the approximated dis- 
tributions <z((7j) and q{Oj) (j = 1, • • • ,to). 

logp(a;; T, /?, to) > [to, /3] -I- i^q [to, /?] , (16) 
F,[to,/3] = 



g(CTj)g(0j) 



(17) 



Fq[TO,/3] 



EE E 'Z('T,)g(cT,+i)(7Vlog6-|-s(a„fT,+i)/(/3,r)), 

(18) 

where B^s = — is called the effective inverse tem- 
perature. If /3off — 1, -Fc["t., /?] is the sum of to 
classical variational free energy, i.e., Fc[to, /? — 1] = 
^Jl^ F[g((Tj), g(0j)]. _Fq[TO, /?] becomes large as CTj 
and CTj+i move approach each other. In practice, the 
Trotter number to indicates the number of multiple 
SAVBs with different initializations, g(crj) and qiOj) 
are the approximations of posterior distributions in the 
j-th SAVB where index j = 1, • • ■ , to is randomly la- 
beled. /(/3, r) indicates the interaction between the 
j-th and the j + 1-th SAVB. 

One problem crops up here. The class labels are not 
always consistent between the j-th and the j -I- 1-th 
SAVB, i.e., class label fc in the j-th SAVB does not al- 
ways correspond to class label fc in the j -f 1-th SAVB 
because the initialization of SAVBs is not the same. 
For example, assume that (zj^i, Zj_2, Zj.a) — (1,1,2) 
and (zj+i^i, Zj^i^2, 2:^+1,3) = (2, 2, 1) where Zj^i denotes 
the latent class label of the i-th data point in the j-th 
SAVB. In this situation, it can be said that class label 
1 in the j-th SAVB does not correspond to class label 
1 but class label 2 in the j -I- 1-th SAVB. 

Let us introduce the projection pj in class labels to ab- 



Algorithm 1 Quantum Annealing for Variational 
Bayes Inference. 
1: Initialize inverse temperature PeS, quantum field 

r and model parameters. 
2: for all iteration t such that 1 < t < where 

denotes the number of outer iterations do 
3: for j = 1, m do 

4: for all iteration I such that 1 < I < where 
L*" denotes the number of inner iterations do 
5: for i = 1, AT do 

6: VB-E step: Update q{aj^^) with Eq. ^ 

7: end for 

8: VB-M step: Update g(0j) with Eq. ^ 

9: end for 
10: end for 

11: Compute p with Eq. ([22]) and Eq. (123]) 

12: Increase inverse temperature /3cff(if /3cff > 1, 

/3cff = 1)1 and decrease quantum field T. 
13: end for 



sorb the difference of class labels between the j-th and 
the j + l-th SAVB. k' — pj{k) indicates that k in the j- 
th SAVB corresponds to k' in the j+l-th SAVB. In this 
way, we have S{aj.i,aj+ij) = J2k=i '^j,i.k<^j+i.t,pAk) 
where dj^i = (aj,i,i,--- ,aj^i^K), i.e., crj,i,fc takes 1 if 
Zj^i = k, and otherwise 0. q((Jj^i^k) denotes q{zj,i — k). 
We have 

Fq[TO,/?] = 

m N K 

niN log b + fiP, T)J2J2Y1 l(''i.^.k)q{ 
j=i 1=1 fe=i 



(19) 



Therefore, we obtain the following updates by taking 
the functional derivatives of Fc[m, P] + Fq[m,0] with 
respect to q{<yj,i^k) and q{6j) , and equating them to 
zero 

q{.cr],i,k) ocexp{ J q{9j)Pceiogp{x,aj,ej)d9j 

+fiP,'^)iqi<JJ^l,,^pT}^^k))+9i<^3 + h^,pAk)))} 

(20) 

q{ej) o^pie^Y"" exp{E g(o-j)/3eff logp(a;, CTj, 6»j)}, 

(21) 

where p~^ is the inverse projection of p. q{o'j.i,k) hi- 
dicates the probability that the latent class of the i-th 
data point will be k in the j-th SAVB. As clarified 
by Eq. q{aj^i^u) approaches g(cr^_i j ^71^(4,)) and 

<l{(^3+i,i,p,{k)) as /(/3,r) Increases. Therefore, /(/3,r) 
works as the interaction explained by Fig[IJb). 



4.3 Estimates of Class-Label Projection p 



t denotes the i-th iteration. 



We estimate tlie class label projection, p, because such 
projections represent implicit information. We esti- 
mate p by maximizing Fc [to, /3] + Fq [to, /3] . To be more 
precise, we extract the pairs {k, Pj{k)){j — 1, • • • ,m) 

m N K 

that maximize X! X! X! li^3,^,k)q{(^:j+i,i,Pi(k)) in Eq. 
j=i i=i k=i 

P^ . This is called the "assignment problem", which 
is one of the fundamental combinatorial optimiza- 
tion problems. Even though the Hungarian algorithm 
solves the assignment problem with computational 
complexity 0{K^), we use the following approximation 
algorithm whose computational complexity is 0{K^) 



N 



Pj{k) = argmaxy^ g(o-jVi.fc)g(gj+i, 

1=1 

N 

P'j-li^) = argmax^q(crj,i,fe)(7(crj_i. 



^.k'), (22) 
(23) 



i=l 



The pj above means that k in the j-th SAVE corre- 
sponds to k' in the j -I- 1-th SAVE that has the high- 
est correlation between (g((7j_ijt), • • • ,q{aj,N,k)) and 
{l{<^j+i,i,k'), ■ ■ ■ ,q{o-3+i,N,k'))- 

5 Experiments 

We applied SAVE and QAVB to latent Dirichlet allo- 
cation (LDA) that is one of the most fam ous proba- 



bilistic graphical models ( Blei et al. . 20031 ). We used 



the Reuters corpufl and the Medline corpu^.We ran- 
domly chose 1,000 documents from the Reuters corpus 
that had a vocabulary of 12,788 items. We randomly 
chose 1,000 documents from the Medline corpus that 
had a vocabulary ofl4,252 items. We set the number 
of topics of LDA to 20. 

5.1 Annealing schedule 

The annealing schedule of temperature T (in practice, 
inverse temperature /3 = ^) and quantum effect pa- 
rameter r exert a substantial influence of SAVE and 
QAVB processes. Although a certified schedule for 
temperature is well known in Monte Carlo simulations 
(jGenian and Geman, 1984, ) , we have not yet obtained 
any mathematically rigorous arguments for T and T 
in SAVE and QAVB. Since interaction / is a function 
of r and P, we have to consider the schedule of / in 
practice. 

In this paper, we use the annealing schedule /3 = /3orp 
and /?eff = Pesof^ff^fi that iKatahira et al.l (|2008l ) used. 



We also use the following annealing schedule T — Tq^ 
feadowa ki and Nishimoril (Il998') used. We tried the 
schedules of f3 with combinations of /3o=0.2, 0.4, 0.6 
and 0.8, and r/3=1.05, 1.1 and 1.2 in SAVE. As a 
results, we observed (3o = 0.6 and — 1.05 cre- 
ated an effective schedule in SAVE for LDA. The too 
low inverse temperature did not work well in LDA. 
This observatio n was similar to SAVE for the hidden 
Markov model (jKatahira et al.l . 120081 ). Therefore, we 
set (3o = PcSo = 0.6 and = r^^j,. = 1.05 in SAVE and 
QAVB. We varied Tq and have shown the schedule of 
P and / in FigEl 



5.2 Experimental results 



We ran QAVB five times in all experiments with a 
Trotter number, to, of 10. The results from this exper- 
iment were the average of the minimum negative varia- 
tional free energy, inmj{—F[q{aj), q{6j)]}, of each run. 
SAVE was randomly restarted until it consumed the 
same amount of time as QAVB. We ran five batches 
of SAVE, and each batch consisted of 20 repetitions 
of SAVB. The results from this experiment were the 
average of the minimum variational free energy of all 
batches. These experimental conditions for QAVB and 
SAVE enabled a fair comparison of these two exper- 
iments in terms of the execution time. In fact, the 
averaged execution times for QAVB (to = 10) and 20 
SAVEs corresponds to 20.5 and 22.3 h for Reuters, 
and 20.4 and 22.9 h for Medline. We set the number 
of outer iterations at L""* = 300 in Step [2] in Algo- 
rithm [TJ The number of inner iterations we tried was 
L"'=l, 5, 10 and 20 in SAVB. We found = 20 was 
effective in SAVE for LDA. Therefore, we set = 20 
in SAVB and QAVB for LDA. 

FigHlplots the averages for the minimum negative vari- 
ational free energy with the mean squared error for 
Reuters and Medline. In both corpora, each of which 
has different properties, QAVB outperforms SAVE for 
each Fq because the introduction of a novel uncertainty 
into a model, in this case LDA, works well. QAVB ap- 
proaches SAVB as To increases because interaction / 
remains in the limited number of iterations. More- 
over, we observed QAVB worked well if interaction 
/ > after SAVBs find sub-optimal states. We think 
fast schedules, i.e. small Fq, did not perform well be- 
cause the term with interaction / in Eq. ((20)l is noisy 



^http: / /www. daviddlewis.com/resources/testcoUections/reu^fefeSlSiT^' is not estimated accurately in the small 
'',http://www. nlm.nih.gov/pubs/factsheets/medline. html ^ number of iterations. 
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Figure 3: Schedules for inverse temperature /? and in- 
teraction /. 
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Figure 4: Comparison of QAVB and SAVB in Reuters 
(Top) and Medline (Bottom). The horizontal axis is 
Fq. The vertical axis is the average for the minimum 
energy where the low energy is preferable. 

6 Conclusion 

We proposed quantum annealing for variational Bayes 
inference (QAVB). QAVB is a generalization of the 
conventional variational Bayes (VB) inference and 
simulated annealing based VB (SAVB) inference ob- 
tained by using a density matrix that generalizes a 
finite probability distribution. QAVB is as easy as 
SAVB to implement because QAVB only has to add 
interaction / to multiple SAVBs, and only one param- 
eter, Fq, is added in practice. The computational com- 
plexity of QAVB is larger than that of SAVB because 
QAVB looks like m parallel SAVBs with interactions. 
However, we empirically demonstrated that QAVB 
works better than SAVB which is randomly restarted 



until it uses the same amount of time as QAVB in la- 
tent Dirichlet allocation (LDA). Actually, it is typical 
to run SAVB many times because SAVB does not nec- 
essarily find a global optimum and is trapped by poor 
local optima at low temperature. In practice, the bot- 
tleneck in QAVB is the computational complexity of 
the projection of class labels in Section [4?3l which is a 
search problem for one nearest neighbor. An improve- 
ment in this algorithm to project class labels would 
lead to more effective QAVB. 

Finally, let us describe future work. We intend to in- 
vestigate an effective projection algorithm, other con- 
structions of quantum effect Tiq, and a suitable sched- 
ule of a quantum field for F. We also plan to apply 
QAVB to other probabilistic models, e.g., a mixture of 
Gaussians and the hidden Markov model. 
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A Details of the Suzuki-Trotter 
Expansion 

This section provides the details to derive Eq. ([T2|) 
from Eq. (fTO|) . li Ai, - ■ ■ , An are symmetr i c ma - 
trices, the Trotter product formula ( Trotter . Il959f ) 
is exp(Er=i^O^= {Y{U^MA^/m)r + 0{^) 
. Note that (nr=i ^^PC^j/"^))™ becomes equal to 
(X]r=i -^i) ™^ ^^'^ limit of m — > oo. 

Hence, let ci be the X^-dimensional binary indicator 
vector mentioned in Section [2l we have 
Tr{e-'3(«c+^^,)} 



E 



•He 



CTl +0 



(24) 
(25) 



Then, by inserting the identity matrices: ctjCtJ = 
E/fN between the product of m exponentials in Eq. 
dig), Trje-''^} leads to 

JL-u I iT —IL' 



(26) 

The expression above means auxiliary variables are 
marginalized out : {(Ji,a'i,cf2,(j'2, <Jm , <y'm } ■ 

Here, we derive simpler expressions for aje^'^'^^a'j 

and (T^"^e~m ■^iCTj+i. The former derives the following 
expression directly from its definition, 

aje-^^^a^. =e^'°^P^^^^^^S{<j,,<j'^), (27) 

where S{aj,aj) = 1 if aj — a'j and S{aj,<7'j) — 
otherwise. Next, we derive simpler expression for 
a^.^e-^«-(Tj+i. Using {A^B){C(g>D) = {AC)®{BD) 
, ^ e-^ie^" when A1A2 = A2A1, and = 

0r=i S'^'i' 



N 



N X °° 1 f (3 \ ' 

i=l 1=0 ^ ^ 

n 00 / ^ \ ^ 

l\ \ m , 



n 00 

HE 

i=l 1=0 



1 / ^V~T 



a'.Ai^K - lK)}'a,+i,,. (28) 



a'j^j^ {{(Ek — Ir-)}'} (Tj+i.i is calculated as 
a^^{{(Ex-lK)}'}a,+M 
= a'^, |ek + ^ {(1 - A:)' - 1} Ik- I ^j+i, 



6{a',.„a,+,.,) + -{{l-ky (29) 



Thus, we have 

/T — i^TY 



HE 

1=1 ;=o 



a',,{(EK-li^)}''^J+l,^ 



" f sr 1 
i—1 ^ 



= n {aSi'^'J,^,^J+l,^) + b}= 5"e^(-'--^-+i)i°g(^). 

(30) 
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